feat: programmatic tool calling (the script tool)#50
Conversation
…ag registry)
Add an opt-in `script` tool (TSFORGE_SCRIPT=1) that lets the model write ONE
TypeScript program calling tools through generated `./tsforge-tools` stubs,
collapsing a mechanical multi-step tool chain into a single model turn. Only the
script's stdout returns to the model — intermediate results never enter context.
Mechanism: doScript writes the stub module + the model's code to a temp dir,
starts a loopback RPC server (Bun.serve on 127.0.0.1, one-time token), and runs
`bun run script.ts`. Each stub call POSTs {tool,args} back to the server, which
dispatches through the EXISTING executeTool chokepoint — so scope, the unified
policy, the write-guard, mutation accounting, and the gate all still apply. The
model gains ergonomics, not new powers (same trust level as `run`).
Bounded by a wall-clock timeout (kill), a per-script tool-call cap, and output
condensing. The RPC subset excludes the scaffolds, the dependency installer,
yield, and `script` itself (no recursion); requests are token-gated and
serialized so concurrent stub calls can't interleave a mutation. Not advertised
in plan mode and rejected at dispatch there, so the "no writes while planning"
guarantee holds.
Also introduces TOOL_SPECS in agent.constants.ts — one source of truth for
per-tool flags (readOnly, scriptExposable) that READ_ONLY_TOOL_NAMES and the
script-exposable subset now derive from, replacing hand-kept sets.
Tests: 15 new (stub generation, single-turn batching + call accounting, real
in/out-of-scope writes through executeTool, plan-mode rejection, call cap,
timeout kill, token/recursion/non-exposable guards, serialization, registry
equivalence). Full `bun run validate` green (1616 pass).
Eval validation (A/B: TSFORGE_SCRIPT off vs on, TTSR + token cost at
equal-or-better gate pass-rate) to follow — the feature ships only if the sweep
shows a real, non-regressing cost reduction.
There was a problem hiding this comment.
Code Review
This pull request introduces a new script tool that enables programmatic tool calling by executing a TypeScript program to batch multiple tool calls into a single turn. The feedback highlights two critical issues: first, creating the temporary directory in the system temp folder breaks module resolution for project dependencies and can leak resources on server initialization errors; second, executing multiple writes within a single script turn bypasses the write-guard and touched tracking because the current tracking mechanism only records the last written file.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
…r-use The first A/B showed `script` was neutral-to-negative on create-heavy tasks: the model reached for it on trivial independent creates (which it already batches into one turn), adding a cycle. Two changes fix and prove it: 1. Retarget the guidance: use `script` ONLY when the change to many files DEPENDS on first reading each file (a read→act loop the model otherwise splits into a read turn + an edit turn). Explicitly steer away from independent batch-creates/edits. This kills the over-use. 2. Add the `migrate` eval seed — a brownfield codemod (8 services, each edited using a tier read from its own header comment) — the read-dependent shape where PTC should help. A/B (DeepSeek, temp 0): - migrate (read-dependent codemod), pooled n=20/variant: off 60% pass, ~3.0 cyc, ~17s — stuck ~40% of runs on 95% pass, ~1.7 cyc, ~13s — stuck ~5% of runs (Fisher p≈0.008) - simple controls (validators/fixtures/handlers): on == off cycles, equal or slightly faster, equal quality — over-use regression gone. Given a real win on its target shape and no regression elsewhere, flip `script` to DEFAULT-ON with a `TSFORGE_NO_SCRIPT` kill switch (matching NO_LSP/NO_GIT) — no opt-in flag for users to think about. It makes no network calls, so default-on keeps eval sweeps deterministic. The sweep's `script` A/B dimension now toggles `TSFORGE_NO_SCRIPT` (inverted, like `git`). Tests updated for default-on (gating: present by default, withheld under TSFORGE_NO_SCRIPT). Full `bun run validate` green (1616 pass).
Eval: proven, now default-onTuned the tool after the first A/B showed over-use on trivial tasks, then re-measured (DeepSeek, temp 0). Win — read-dependent multi-file codemod (
Fisher exact p≈0.008. Doing the codemod manually makes the model thrash and stall ~2 in 5 runs; a script makes it reliable, ~38% fewer cycles, ~23% faster, same quality. No regression — simple controls ( Decision: flipped Full |
…s (PR #50 review) Two critical issues from Gemini review: 1. The script ran from a temp dir under the system tmpdir, so a script importing a project dependency (zod, etc.) failed module resolution, and the dir leaked if startRpcServer threw before the try block. Create the temp dir inside ctx.cwd (hidden .tsforge-script-* prefix so eslint/tsc ignore it) so Node/Bun resolution walks up to the workspace node_modules + relative imports, and move the server start inside try so the dir is always cleaned up. 2. runToolCalls tracked a single wrote.path, overwritten per edit/create event — so a script that writes N files only recorded the LAST in touched and only write-guarded that one; the rest bypassed the write-guard and change-scoped rules (test-sibling-required). Collect ALL written in-scope paths in a Set and recordTouched + write-guard each. Tests: script resolves a workspace node_modules dep; a 3-file script records all three in touched (state.edits=3) and leaves no temp dir behind. validate green (1618 pass).
What
Adds an opt-in `script` tool (`TSFORGE_SCRIPT=1`) — Programmatic Tool Calling. The model writes ONE TypeScript program that calls tools through generated `./tsforge-tools` stubs, collapsing a mechanical multi-step tool chain into a single model turn. Only the script's stdout returns to the model; intermediate results never enter context.
Why
Attacks the one axis the correctness gate is blind to — token + latency cost. Exploration/multi-file work that today costs N model turns (read 8 files, fetch+compare packages, transform-then-write across files) becomes one turn. Inspired by hermes-agent's PTC.
How (no new powers, just ergonomics)
Each stub call POSTs `{tool,args}` to a loopback RPC server that dispatches through the existing `executeTool` chokepoint — so scope, the unified policy, the write-guard, mutation accounting, and the gate all still apply. Same trust level as `run`. Bounded by a wall-clock timeout (kill), a per-script call cap, and output condensing. RPC subset excludes scaffolds/installer/yield and `script` itself (no recursion); token-gated and serialized. Not advertised in plan mode and rejected at dispatch there.
Also adds `TOOL_SPECS` — one source of truth for per-tool flags (`readOnly`, `scriptExposable`); `READ_ONLY_TOOL_NAMES` + the script-exposable subset now derive from it.
Tests
15 new (`script-tool.test.ts` + gating/accounting): stub generation, single-turn batching + call accounting, real in/out-of-scope writes through `executeTool`, plan-mode rejection, call cap, timeout kill, token/recursion/non-exposable guards, serialization, registry equivalence. Full `bun run validate` green (1616 pass).
Eval status (gating this PR)
🤖 Generated with Claude Code